An Optimized Sparse Approximate Matrix Multiply for Matrices with Decay 
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We present an optimized single-precision implementation of the Sparse Approximate Matrix Mul- 
tiply (SpAMM) [M. Challacombe and N. Bock, arXiv 1011.3534 (2010)], a fast algorithm for matrix- 
matrix multiplication for matrices with decay that achieves an O (n log n) computational complexity 
with respect to matrix dimension n. We find that the max norm of the error achieved with a SpAMM 
tolerance below 2 x 10~ 8 is lower than that of the single-precision SGEMM for dense quantum chem- 
ical matrices, while outperforming SGEMM with a cross-over already for small matrices (n ~ 1000). 
Relative to naive implementations of SpAMM using Intel's Math Kernel Library (MKL) or AMD's Core 
Math Library (ACML), our optimized version is found to be significantly faster. Detailed perfor- 
mance comparisons are made for quantum chemical matrices with differently structured sub-blocks. 
Finally, we discuss the potential of improved hardware prefetch to yield 2-3x speedups. 

Keywords: Sparse Approximate Matrix Multiply; Sparse Linear Algebra; SpAMM; Reduced Complexity 
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I. INTRODUCTION 

For large dense linear algebra problems, the compu- 
tational advantage offered by fast matrix-matrix multi- 
plication can be substantial, even with seemingly small 
gains in asymptotic complexity 1 . Relative to conven- 
tional multiplication which is O (n 3 ) , Strassen's algo- 
rithm |130| achieves O (n 28 ) , while the Coppersmith 
and Winograd method [IB] and the method of Williams' 
[136] are bound by O (n 23755 ) and O (n 2 - 3727 ) , respec- 
tively. For these dense methods, balancing the trade-off 
between cost, complexity and error is an active area of 
research J5T] [52j I154J . On the other hand, large sparse 
problems are typically handled with conventional sparse 
matrix techniques, with only small concessions between 
multiplication algorithms. 

Intermediate to these regimes, a wide class of problems 
exist that involve matrices with decay, occurring in the 
construction of matrix functions |18j . notably the ma- 
trix inverse [201 I50j , the matrix exponential [88] , and in 
the case of electronic structure theory, the Heaviside step 
function (spectral projector) [IT, 19, 34, 35, 1 0711118] , A 
matrix A is said to decay when its matrix elements de- 
crease exponentially, as \a>ij\ < c A' l_J ', or algebraically 





as \aij\ < c/(\i — j\ + 1) with separation \i — j\. In non- 
synthetic cases, the separation \i — j\ may correspond to 
an underlying physical distance \fi — r*,|, e.g. of basis 
functions, finite elements, etc. [17] . Note, matrices with 
decay are typically not sparse, but exhibit a structure 



FIG. 1: Space filling curves (SFC) map atoms in Cartesian 
space (A) onto an index that is locality preserving (B), leading 
to clustering of matrix elements due to decay and the under- 
lying physical system, shown for the sparsified case (C). In 
addition the locality preserving mechanisms of the SFC can 
be extended to 2-D (ij) data decomposition of matrices, (C), 
as well as three-dimensional decomposition (ijk) in product 
space of the matrix-matrix multiply (D). 
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1 Throughout, we refer to the asymptotic bounds of an algorithm 
in big-O notation when using the term "complexity", while the 
term "performance" is reserved for the actual runtime or cycle 
count of an implementation. Since performance is affected by 
the complexity of the underlying algorithm and computer and 
implementation specific details, complexity and performance are 
not necessarily the same. 



that can be exploited to yield an approximate sparse al- 
gorithm that exhibits a reduced complexity. 

Reordering the physical indices Fi along a locality pre- 
serving space- filling curve (SFC), such as the Hilbert [84] 
or the Peano curve [121] . maps elements that are close 
in three dimensions onto an ordered list [1231 11391 1143] , 
leading to effective clustering of submatrices by order of 
magnitude 35j . Figure n] illustrates this point for quan- 
tum chemistry applications where the atom indices of the 




FIG. 2: The decay of matrix element magnitudes of a con- 
verged spectral projector (density matrix) for a (H2O) 300 wa- 
ter cluster at the RHF/6-31G" level of theory (n = 7500), 
where the molecular geometry has been reordered with a 
space filling Hilbert curve. The different colors indicate dif- 
ferent matrix element magnitudes; red: [0, 10 -8 ); green: 
[l0~ 8 ,10~ 6 ); blue: [l0~ 6 ,10~ 2 ); violet: [l0~ 2 ,l], corre- 
sponding to approximate exponential decay. 



molecule shown in panel A are ordered along the SFC, 
shown in panel B, leading to the clustered structure of 
matrix element magnitudes shown schematically in panel 
C of Fig. [l] and specifically in Fig. [2] Extending SFC or- 
dering to the two dimensional vector space of matrix ele- 
ments Aij, shown in panel C, and the three dimensional 
product space of product contributions AikBkj, shown 
in panel D, extends the locality preserving properties 
of SFC to matrix element storage and matrix product 
execution order. Historically, complexity reduction for 
matrices with decay has been achieved through "sparsi- 
fication" or dropping of small matrix elements and the 
use of conventional sparse matrix kernels [2j)J [27J [5JJ1 1551 - 
IFJTl UOOj . Alternatives such as divide and conquer exist 
[9"S"1 11131 11511 I152J , where block diagonal subsystems are 
"glued" together, but these methods arc unable to re- 
tain long range coherence effects that can manifest in the 
spectral projector, show for example in Fig. [2] as cross- 
diagonal bands. Recently, the authors introduced a fast 
method for calculating the product of matrices with de- 
cay, the Sparse Approximate Matrix Multiply (SpAMM) 
[37j . which is different from truncation in the vector 
space, (ij) and (jk), in that it hierarchically eliminates 
insignificant contributions in the product space (ijk), 
leading to complexity reduction through a sparse prod- 
uct space. The recursive approach is similar to Strassen's 



algorithm [130] and also to methods of improving lo- 
cality across memory hierarchies pioneered by Wise and 
coworkers © ESI QQS1 IHSHng] . 

SpAMM belongs to a broad class of generalized TV-Body 
solvers, "fast" algorithms that achieve reduced complex- 
ity through hierarchical approximation, achieving in- 
creasingly broad applicability and high performance im- 
plementations with surprising commonalities. Related 
solvers include the database join operation [331 [Ml IHOl 
I124J . gravitational force summation [140111411 1143| . the 
Coulomb interaction [301 1126) . the fast Gaufi trans- 
form [TH [781 H2H1 USB [149], adaptive mesh refinement 
QH [2TJ [52], the fast multipole method gSJ [75rT77] and 
visualization applications [TOl [2J [33 [Ml EED IM [155] . 

SpAMM is similar to the 'H-matrix algebra of Hack- 
busch and co-workers [72J [HI] and the Hierarchically 
Semiseparable (HSS) representation of Chandrasekaran, 
Gu and co-workers [UJ |43], where off-diagonal sub- 
matrices are treated hierarchically as reduced rank fac- 
torizations (truncated SVD), typically structured and 
grouped to reflect properties of the underlying operators. 
For problems with rapid decay, truncated SVD may be- 
have in a similar way to simple dropping schemes. How- 
ever, SpAMM is different than the "H-matrix or the HSS 
algebra as it achieves separation uniquely in the product 
space and does not rely on a reduced complexity repre- 
sentation of matrices. For very slow decay, where SpAMM 
is ineffective, the "H-matrix or HSS algebra may certainly 
offer an alternative. 

In Ref. [37] , a naive implementation of SpAMM was de- 
veloped using recursion and a simple unoptimized 4x4x4 
multiplication at the lowest level. In this article we de- 
scribe an optimized, single-precision, non-recursive ker- 
nel of SpAMM that can be introduced at any level of a re- 
cursive scheme targeting full precision of the conventional 
SGEMM kernel. Precision of the SGEMM corresponds roughly 
to many approximate double-precision 0(n) schemes, 
which yield ss 7 digits in the total energy and « 5 digits in 
the forces, with an additional two-fold savings in time and 
space. The optimized algorithm includes a hand-coded 
assembly kernel using single-precision SSE instructions 
for 4x4x4 matrix products, demonstrating significantly 
higher performance than any naive implementation using 
vendor tuned BLAS kernels, while yielding superior error 
control. 

Our article is organized as follows: In Sec. [IT] we de- 
scribe briefly the SpAMM algorithm for the matrix mul- 
tiply. 



In Sec. Ill 



we describe in detail our high- 
performance implementation. In Sec. |IV[ we describe the 
benchmark setup and methodologies used. In Sec. [V] we 
present our results for benchmarks on representative den- 
sity matrices from quantum chemistry calculations per- 
formed on two hardware platforms, Intel Xeon X5650 and 
AMD Opteron 6168. In Sec. |VI[ we discuss the results 



and, finally, in Sec. VII we present our conclusions. 



II. ALGORITHM AND THEORY 

The SGEMM (single precision) function of BLAS calcu- 
lates the following expression, 



C = olAB + 0C, 



(1) 



where A, B, and C are m x k, k x n, and m x n ma- 
trices, respectively, and a and j3 are scalar parameters. 
SpAMM calculates the same but is optimized for matrices 
with decay. In SpAMM we store an to x n matrix in a 
quaternary tree (quadtree) data structure [55], an idea 
developed by Wise et al. [5J HH EB], Beckman [15], 
and Samet [1221 1123] . The quadtree resolves the ma- 
trix decay pattern hierarchically and is particularly well 
suited for the representation of matrices with a clustered 
structure such as that shown in Fig. [2] and explained in 
Fig.fll Furthermore, computational cost for quadtree ac- 
cess exhibits favorable scaling with respect to matrix size 

ma. 

Compared to the more traditional Compressed Sparse 
Row or Column (CSR or CSC, respectively) data struc- 
ture, which is typically stored in contiguous arrays, the 
quadtree can easily be allocated non-contiguously and 
allows for insertion and deletion without requiring data 
copy operations. Note that even in the case of more ad- 
vanced CSR data structures the pointer table is allocated 
contiguously, see for instance the BCSR data structure of 
Challacombe [35] which uses blocked data, or the doubly 
compressed sparse row format (DCSR) of Buluc, et al. 
[2"5] developed for the efficient representation of hyper- 
sparse matrices. 

In addition, CSR based implementations face chal- 
lenges in parallel, currently significantly suppressing per- 
formance scaling to large processor counts. In the context 
of electronic structure calculations for instance, it has 
been observed that data movement in parallel CSR sum- 
mation can dominate the parallel construction of sparse 
Hamiltonians |60| . Similarly, the parallel sparse matrix- 
matrix multiply (SpGEMM) of Buluc et al. [HI ISO] be- 
comes communication bound with increasing processor 
count due to the 2-D decomposition and random permu- 
tation of matrix rows and columns that simplify commu- 
nication and load balancing, but lead to communication 
cost scaling as in the dense (Cannon's algorithm [33] or 
SUMMA [134]) case, O (a y /p + ficn/y/p), where a rep- 
resents communication latency, /? communication band- 
width, and c the average number of non-zero elements 
per row. 

There is also a fundamental philosophical difference 
between SpAMM and CSR based matrix-matrix multiply 
algorithms with SpGEMM as a state of the art example. 
On the one hand, SpAMM exploits locality manifest in clus- 
tering of matrix elements by order of magnitude (Fig. [2| 
to achieve a reduced complexity through hierarchical ap- 
proximation in the product space. On the other hand, the 
parallel SpGEMM exploits randomization and correspond- 
ing loss of locality to achieve load balance. While Bulug's 



idea has been deployed by a large scale quantum chem- 
istry program [135 , the impact of data reordering and 
loss of locality on scalability would appear to also suffer 
from communication bottlenecks with same formal com- 
plexities as Bulug's. It remains for us to demonstrate 
that these complexities can be overcome in the SpAMM ap- 
proach by retaining locality and achieving load balance 
instead in overdecomposition of recursive task space. In 
addition, the SpGEMM approach would strongly suppress 
performance of quantum chemical algorithms due to to- 
tal loss of (matrix) locality, as well as interoperability 
problems due to global ordering and reordering of data. 
The non-contiguous allocation of a quadtree and the 
SFC ordering of matrix elements should therefore be 
more suited for integration with other quantum chemical 
solvers and in parallel implementations where data and 
work distribution in a non-shared memory environment 
is important and data repacking undesirable. 



A. Quadtree Data Structure 

Given an m x n matrix, the depth of the quadtree is 
given by 



'log (m/n b ) log (»/»&) \ 

d = T maX ,„,n ' i„,o H> 



log 2 



log 2 



(2) 



where n& x n& is the size of the dense submatrices stored 
at the bottom of the tree, chosen for performance. To 
simplify recursion, the matrix is zero padded to square 
shape of size n p = n^ 2 d . The computational cost of 
the lookup, insertion, and deletion of matrix elements is 
O (log Tip), independent of the decay pattern. The root 
tier is denoted as t — and at each tier t < d of the 
quadtree, a node links to the subnodes of the 2x2 sub- 
matrix underneath it, 
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(3) 



If a submatrix is zero, its link is set to NULL and the 
tree terminates at that point. Finally, at the bottom tier 
t = d, each node stores a dense rib x n& submatrix. 



B. SpAMM 

In each node we also store the Frobenius norm of its 
submatrix, which is given by 



\ 



hj=i 



Krwh 



(4) 



Since the square of the Frobenius norm is additive in its 
submatrix norms we can construct it recursively start- 
ing at the bottom tier. The matrix product is defined 



recursively at each tier as 



fe=i 



subject to the SpAMM condition 
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(5) 



(6) 



where r is the SpAMM product tolerance. 



C. Matrix Sparsity vs. Product Sparsity 

Matrices with algebraic or exponential decay are not 
necessarily sparse unless elements are dropped (sparsi- 
fied). Storage of dense matrices is O (n 2 ), while lookup 
and insertion/deletion is O (log n) . In practice though, 
with r > 0, truncation in the product space leads nat- 
urally to low level truncation in the vector space of the 
result, C. Likewise, it is possible to apply a very small 
threshold to the product matrices A and B to maintain 
an O (nlogn) computational complexity of the multiply 
(for further discussion, see Sec. VI). In particular, using 
a drop tolerance e, with 

e= max(||A|| F ,||B|| F )' ?il 

is numerically consistent with the SpAMM condition. The 
accumulation of error due to low level sparsification and 
product space truncation are certain to be application 
specific, with a detailed analysis beyond the scope of the 
current work. Here we consider only errors associated 
with application of the SpAMM condition to full (dense) 
matrices. 



III. IMPLEMENTATION 

Figure [3] shows the performance of tuned SGEMM kernels 
from MKL [3 and ACML [T] with increasing matrix size. For 
small matrices, the performance is mostly throttled by 
memory access since memory latency is very large with 
f=a 60-100 ns (at 2.8 GHz, this translates into roughly 
168-280 cycles) per memory access to main memory [98l 
Table 2] , compared to the latency of an addition or mul- 
tiplication operation, taking only 3 or 4 cycles [57J Xeon 
X5650, Appendix C], respectively. In order to mitigate 
this latency, optimized, cache-aware SGEMM implemen- 
tations use detailed knowledge of the cache hierarchy 
and data reordering techniques, such as partitioning and 
packing into contiguous buffers [fJSHTOl ISO"] , to replace 
main memory access with cache access, which is signifi- 
cantly faster (access to LI data cache takes ~ 10 cycles). 



A. Cache-Oblivious Algorithms 

A class of alternative approaches is aimed at avoiding 
knowledge of the cache hierarchy altogether through im- 
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FIG. 3: The performance of SGEMM of the MKL and ACML li- 
braries on AMD Opteron 6168 and Intel Xeon X5650. Clearly 
shown is the slow ramping up of performance with matrix 
size. For comparison the average performance on a dense 
16 x 16 submatrix of our numeric multiply is shown as hori- 
zontal lines. 



proved data locality, and is typically referred to as cache- 
oblivious methods OH [57j [Ml LZll IH2 ED LlQSl QUI HT71 
11331 I148J . Typically, a heuristic such as a SFC is applied 
to storage and computation ordering as shown in Fig. [T] 
(C) and (D), which yields excellent locality for both, and 
the newly gained locality obviates cache-aware methods 
employed by standard dense methods. 

Cache-oblivious algorithms effectively mitigate the is- 
sue of latency between memory hierarchies, and would 
seem to be a natural way to deal with sparse irreg- 
ular problems such as SpAMM. However, fine grained 
cache oblivious methods neglect to address the hardware 
prefetcher, which can have significant impact on applica- 
tion performance [5411119] and are found to underperform 
cache-aware approaches for dense linear algebra problems 

[7jrr53]. 



B. Hardware Prefetch 

The prefetcher predicts the memory access pattern and 
preloads the anticipated addresses into cache without 
software intervention. Currently though, the algorithms 
employed by the prefetcher are geared towards regular 
contiguous problems such as dense linear algebra and re- 
quire fixed stride memory access patterns. In addition, 
prefetching will not cross a 4 KiB page boundary and 
the stride has to be within a very short threshold value, 
which depends on the processor implementation, but is 
typically 256 or 512 bytes. The prefetcher is triggered 
after two successive cache misses in the last level cache 
with address distances within the threshold value [571 
Sec. 2.4.4.4]. 

In an early implementation of SpAMM using cache- 
oblivious methods, we concluded that the effects of the 



hardware prefetcher are significant. Our current under- 
standing is that fine grained cache-oblivious techniques 
introduce variable stride access patterns, which fail to ac- 
tivate the hardware prefetch unit (two successive cache 
misses with constant stride are necessary), leading to 
performance degradation relative to conventional cache- 
aware implementations. This understanding is corrobo- 
rated by the findings of Bader et al [12] who report poor 
performance of their cache-oblivious algorithm for DGEMM 
at fine granularities. While it may be possible to im- 
prove fine-grained performance through software prefetch 
statements, to the best of our knowledge this remains a 
challenging, unrealized goal for irregular algorithms. 

In summary, while it is desirable to apply the SpAMM 
condition, Eq. (roll, aggressively, fine-grained irregular 
memory access precludes the use of hardware prefetch 
in its current form, favoring a coarser granularity. Our 
kernel strikes a compromise between these considerations 
by implementing storage and computation at different 
granularities. We retain a quadtree structure down to a 
granularity of 16 x 16 and divide each dense block into 
4x4 submatrices, stored in row-major order but multi- 
plied under the SpAMM condition. While the required con- 
ditionals in the kernel degrade performance only slightly 
in the dense case they lead to increasingly effective per- 
formance as measured by the inverse of time-to-solution 
(see Appendix [A]), and excellent error control with spar- 
sity. In the following, we detail this strategy. 
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FIG. 4: A schematic representation of the construction of 
linear quadtree indices and where those indices are in the 
original matrix. Shown are the linear indices of first tier for 
the matrix quadrants II, III, and IV. The linear indices of the 
second tier are shown for quadrant I. 



C. Recursive Implementation 

An unoptimized version of SpAMM is straightforward to 
write using recursion and a conventional, vendor tuned 
kernel for the block multiply. For all matrices with decay 



in our tests (see Sec. IV B) we found the fastest time to 



solution for this recursive approach and a conventional 
SGEMM kernel to occur for 16 x 16 blocking. 



D. Fast Implementation 

We divide the SpAMM algorithm into two parts, a sym- 
bolic part that collects the submatrices to be multiplied 
and a numeric part that multiplies the submatrices, de- 
scribed separately in the following. 



1. Symbolic Multiply with Linkless Tree 

Tree data structures and recursive implementations on 
trees can be inefficient in languages such as C/C++ or 
Fortran. While the case of tail- recursion can be inlincd by 
many modern compilers, the general case is significantly 
more challenging and may not always allow for optimiza- 
tion [1281 1132] . Hashed (linkless) tree structures offer a 
fast and efficient alternative. Instead of storing explicit 
links, the nodes are stored in hash tables by a linearized 



index, as described by Morton |lllj and Gargantini [61] 
and the excellent book by Samet [122] . This technique 
is often employed in generalized jV-Body solvers, and 
in particular in the computer graphics realm for colli- 
sion detection and culling [H Hfl ED [Ml WH [Ml CESS], 
and in database applications such as the join operation 
PHlllinilini], with some of these implemented on GPU 
systems. 

Each tree node is identified by its upper left hand sub- 
matrix row- and column indices, i and j. In combination 
with the matrix dimension and the tier depth, the full 
range of matrix indices covered by a node can be readily 
reconstructed. The linear index I is calculated through 
dilating and interleaving i and j: 

i = Y, *™ 2m ( 8 ) 

m=0 

m=0 

I = J2 [i m 2 2m+1 + j m 2 2m ] . (10) 

m=0 

For illustration, Fig. [2] shows the first two tiers of lin- 
ear matrix indices of a small matrix. Notice how each 
tier adds two digits from the right to the linear index 
of its parent node and the quadrant labeling scheme is 
recursively repeated. Despite the lack of links between 
tree nodes, the full tree structure and its hierarchy can 
simply be reconstructed on the way down by adding two 



digits from the right, or on the way up by removing those 
two digits. 

It is worth pointing out that for performance the choice 
of hash function is extremely important. Excessive col- 
lision rates are to be avoided, since hash table perfor- 
mance degrades rapidly with increasing collision rate. 
We use the well designed 32-bit hash function of Jenk- 
ins [50], which limits the tree depth to 16 + 4 (since we 
store 16 x 16 dense submatrices) and the matrix size to 
1048576 x 1048576, more than sufficient for the tests pre- 
sented here. Larger matrices are possible through the 
use of either a larger hash key or a hybrid approach of 
linked tree (at the top) and linkless tree (at the bottom 
20 tiers). 



2. Symbolic Multiply 



fc-value block, the indices are sorted by their associated 
node norms (the Frobenius norm stored at each node, see 
Sec. II B for details) in descending order. Alternatively 
both sort passes can be combined into one pass by rewrit- 
ing the sort comparison function to include both fc-indcx 
and norm, however we have not found any performance 
benefit from this approach. 

Algorithm 1 Linkless tree SpAMM multiply. 



Extract all keys (linear matrix indices) from A and B. 
Sort key lists on k, i.e. Aik and Bhj. 
for all k in key lists do 

Sort sublist by descending norm. 
end for 

Convolve and find A, B, and C submatrices, Eq. ( |12[ ). 
for all products g product array do 

end for 



The convolution of the two-dimensional vector spaces 
into the three-dimensional product space, panels (C) and 
(D) of Fig. n] is straightforward to implement through 
recursively applying Eq. (pM), but the situation is more 
complicated in the case of linkless trees. The hash tables 
of A and B have to be searched and matching subma- 
trix pairs An- and B^j have to be found, conditional on 
Eq. (pi), which result in a contribution to the resulting 
submatrix Cij. Related problems are common amongst 
generalized TV-Body solvers, such as performing queries 
on hashed trees to join [9], collide [10], add [142J . or in 
other ways combine elements. 

The collection of the submatrices that contribute to 
the product is carried out in several steps. First, the 
linear indices / for A and B are extracted from the hash 
tables of the lowest tier (16 x 16 submatrix level) and 
stored in an array. From Eq. ( |10[ ) the indices are given 
bit-wise by: 



I A. 



. i 2 &2*i&i*ofco- 



(11) 



Second, in a first pass, the index arrays are sorted on their 
/c-index values using the merge sort algorithm 2 , resulting 
in an ordered array partitioned into blocks with identical 
k- values; "A:- value blocks" . Application of the bit-mask 
Ob. . .010101 to lA,ik and 0b. . .101010 to l B ,k 3 renders 
this step straightforward. In a second pass, within each 



2 Although quicksort is often found to be faster than merge sort 
in practice, it turns out to be a poor choice in our case: Sorting 
n elements, merge sort exhibits average and worst-case cost of 
O (nlogra), while in comparison quicksort exhibits the same av- 
erage cost, but slows down to O (n 2 ) in the worst-case. For the 
matrices we used in our tests, quicksort tended to deviate from 
logarithmic towards quadratic behavior, and merge sort consis- 
tently performed better. We speculate that a different choice of 
hash function could improve the performance of quicksort and 
the sorting steps in the symbolic part, but the design of a good 
hash function which exhibits few collisions and preserves loga- 
rithmic quicksort behavior is beyond the scope of this work. 



Third, the convolution is performed by nested loops over 
the fc-value blocks of the arrays. Although our algo- 
rithm only considers the lowest tier, rendering it non- 
hierarchical, norm sorting the k-vahie blocks restores 
O(nlogn) or O (n 2 log n 2 ) complexity for sparse and 
dense matrices, respectively. We implemented two ver- 
sions: The first consists of straightforward nested loops 
("nested convolution"), while the second is partially un- 
rolled and uses Intel's SSE intrinsics ("unrolled convolu- 
tion" ) that leads to a speed-up of close to three. Since 
SSE instructions require proper alignment of memory, we 
align each fc-value block on a 16-byte boundary and the 
index computations, the norm products, and the product 
comparisons are done in SSE, which reduces the loops to 
a stride of four. We note that the index extraction from 
the hash tables leads to O (n 2 ) storage for the indices 
of dense matrices. While this is not prohibitive for the 
matrices tested here, it may become so for larger matri- 
ces if not ultimately sparsified. Pointers to the nodes of 
each triple of submatrices An- , B^j , and CV, are then ex- 
tracted from the hash tables and appended to an array 
for further processing by our numeric multiply, described 
in the following section. The Cy submatrix indices are 
constructed by masking out the i and j index values of 
Ia,iH and ls,kj and combining the two masked indices, 

lc ij = {I a ik & 0b. . . 101010) | (l B kj & 0b. . . 010101) . 

(12) 
Steps 1-6 of Algorithm [l] show pseudocode of the convo- 
lution step. 

While storing the nodes of the linkless tree in a 
hashtable is very efficient for sparse matrices, a simple 
array is faster for dense matrices. Since the matrices used 
in this work are all dense (albeit with the aforementioned 
decay properties), it is educational to consider a perfor- 
mance comparison between hashtable and array storage 
of the indices. We therefore benchmarked two implemen- 
tations of the linkless tree: Hashtable storage of matrix 
nodes with the nested convolution implementation (la- 
beled "hashtable" in Fig. [5| and array storage of matrix 



nodes with the unrolled convolution implementation us- 
ing SSE intrinsics (labeled "SSE" in Fig. [5). 



3. Numeric Multiply 

Steps 7-9 of Algorithm [l] show pseudocode for the 
numeric multiply, and how it serially processes subma- 
trix products. Several factors need to be considered for 
optimal performance of the kernel: On the one hand, 
Streaming SIMD Extension (SSE) instructions exhibit 
higher performance than comparable x87 instructions 
[571 Sec. 3.8.4]) and in single-precision the SSE vector 
length is 4, suggesting a granularity of 4 x 4 for applying 
Eq. (TgJ) . On the other hand, this 4x4 level of granu- 
larity leads to inefficient cache use and memory access 
which can only be remedied by a larger contiguous data 
structure. Through experimentation, we found 16 x 16 
contiguous blocks to give the best overall performance. 
Finally, optimal data storage and access patterns had to 
be identified. We considered row-major, Morton order, 
and hierarchical versions of these and found that sim- 
ple non-hierarchical row-major storage and access is the 
most efficient. 

A note on the hardware prefetch: The 16 x 16 single- 
precision submatrices occupy 16 cache lines (or more as 
in our SSE version described below), which is sufficiently 
large for the hardware prefetch to accelerate memory ac- 
cess. The conditionals around the 4x4 block multiplies 
do not significantly degrade performance by themselves, 
but when products are skipped due to Eq. (mb, mem- 
ory access becomes irregular and prevents the hardware 
prefetcher from activating, which degrades the flop rate. 
At the same time however, the effective performance, the 
inverse of time-to-solution, increases since more products 
are dropped, more than offsetting the loss of hardware 
prefetch. 



Algorithm 2 SSE version of 4 x 4 multiply. 



register [1 
register [2 
register [3 
register [2 
register [1 
register [2 
register [3 
register [2 
register [1 
register [1 



[0000] 

[An An An An] 
[Bn B12 Bi 3 B14] 
register [2] o register [3] 
register[l] + register[2] 
[A12A12A12A12] 
[-B21 B22 B23 B24] 
register [2] o register [3] 
register[l] + register[2] 
[Cn C12 C13 C14] 



We implemented two versions of the kernel in assembly 
using the SSE and the SSE4.1 instruction sets. The Xeon 
X5650 supports both instruction sets, while the Opteron 
6168 only has support for SSE. The SSE instruction set 
was introduced by Intel in 1999 with their Pentium III 
processor series, adding 70 instructions and eight regis- 
ters (sixteen on 64-bit processors). Lacking a dot prod- 
uct, our SSE kernel relies on a dilated storage scheme for 



the matrix elements of A, in which each matrix element is 
stored 4 times, i.e. in vectors such as [An An An An]. 
This storage scheme allows for the efficient use of SSE 
registers without having to shuffle or copy the result ma- 
trix elements in SSE, leading to a performance improve- 
ment despite the increased demand on memory transfer. 
We show pseudocode of this kernel version in Algorithm 
[2] Note that the SSE multiplication and addition instruc- 
tions used here operate element-wise with the Hadamard 
or Schur product denoted by the symbol o. 

The SSE4 instruction set was introduced by Intel with 
their Penryn processor core in 2006. A subset of SSE4, 
SSE4.1, introduces dot product instructions for packed 
single-precision numbers and the dilated storage scheme 
of the SSE version becomes unnecessary. The matrix ele- 
ments of B are stored in column-major order. Compared 
to the SSE version, SSE4.1 allows for a reduction of data 
movement and it uses a dedicated dot product instruc- 
tion. However, we did not find this version to be faster 
than the SSE version on the Intel Xeon system. 



IV. EXPERIMENTAL METHODOLOGY 

In the following, we give a detailed description of our 
experimental methodology. In Sec. IV A we will describe 



in detail the hardware platforms used, and in Sec. IV B 
the benchmarks used to assess the performance of our 
implementation. 



A. Platforms 

All performance measurements were conducted on two 
hardware platforms, Intel Xeon X5650 (Westmere mi- 
croarchitecture, 2.66 GHz, 6 cores, 32 KiB of LI data 
per core, 256 KiB of L2 per core, 12 MiB L3 shared 
by all cores) and AMD Opteron 6168 (Magny-Cours mi- 
croarchitecture, 1.8 GHz, 12 cores, 64 KiB LI data per 
core, 512 KiB L2 per core, 12 MiB L3 shared by all 
cores), using a NUMA aware Linux kernel (C0NFIGJJUMA 
and C0NFIG_X86_64_ACPIJJUMA) version 2.6.39 on Gen- 
too [51]. Cycle and flop counts where taken with the 
PAP I library [TJ [TTJ H2D] version 4.1.4 3 . Throughout, 
gec [63) version 4.4.5 was used for compilation. It should 
be noted that the choice of compiler has very little im- 
pact on the results of this study because all time intensive 
computational kernels either are linked in from externally 
compiled libraries (MKL and ACML) or are written in assem- 
bly. Care has to be taken for getting accurate and repro- 



3 Cycle count: PAPI_T0T_CYC; Flop count, 

AMD: RETIRED _SSE_0PERATI0NS : 0P_TYPE : ALL, 

Intel: FP_C0MP_0PS_EXE : SSE.FP .PACKED, 

FP_CQMP_0PS_EXE:SSE_FP -SCALAR; Cache misses: 

DATA_CACHE_MISSES 



ducible measurement results [144] ; the benchmark pro- 
cess was locked to a processor core and memory allocation 
was limited to the local NUMA node using the numactl 
command in combination with the — physcpubind and 
— membind command line options. In addition all allo- 
cated memory was pinned to prevent paging using the 
mlockallO system call. 



B. Benchmarks 

In a typical quantum chemistry calculation, the solu- 
tion of an integro-differential equation is sought by ex- 
panding the electronic wavefunctions of a molecule such 
as the one shown in panel (A) of Fig. [T] into a set of 
atomic basis functions yielding a matrix eigenvalue prob- 
lem Q3I], or spectral projection [171 H51 l34l 1351 H07l HT8] . 
The exact form and number of functions in the basis set 
and hence the number of expansion coefficients per atom 
depend on the atom type and the basis set. Because SFC 
ordering is applied atom- wise, matrix elements are clus- 
tered by magnitude, grouped in submatrices given by the 
atom type and basis set. We explore the effect of subma- 
trix size with different basis sets, STO-2G and 6-31G**, 
for quantum chemical matrices corresponding to Hilbert 
curve ordered water clusters. These basis sets introduce 
submatrices of different sizes (STO-2G: H 1 x 1, O 5 x 5; 
6-31G**: H 5 x 5, O 15 x 15), representing the native 
granularities embedded in the matrix structures. The 
matrices were generated at the Restricted Hartree-Fock 
(RHF) level of theory with FreeON, a suite of programs 
for 0{N) quantum chemistry [25] and are fully converged 
and without sparsification (e = 0), i.e. the matrices are 
fully dense but exhibit the approximately exponential 
decay pattern shown in Fig. [2] The sequence of water 
clusters corresponds to standard temperature and pres- 
sure and have been used in a number of previous studies 
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The test calculations consist of taking the matrix 
square, a key step in construction of the spectral projec- 
tor (density matrix purification) [54 ] ITO71 ITU ITT51 1TT5] . 
The error of the product is measured by the max norm of 
the difference to the double-precision (DGEMM) product: 



TABLE I: Performance and error benchmarks 



IIACH, 



max • 

ij 
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DGEMM 

ij 



}■ 



(13) 



which places an upper bound on the element-wise error 
and is a simple check on the accuracy of the multipli- 
cation. For the performance comparisons the tolerance 
was adjusted to match the errors. While the exact de- 
pendence of the product error on r is currently unknown, 
it is worth pointing out that due to its application in 
product space, r affects which product contributions are 
dropped linearly, as opposed to the quadratic dependence 
on e for sparsification [37] . Performance is reported us- 
ing three metrics, the measured flop rate, the effective 
performance, and the cycle count as reported by PAPI. 
For reference, the effective performance is proportional 



Test 



Label 



1. Comparison of convolution imple- Symbolic Multiply 
mentations 

2. Dense multiply with SGEMM from SGEMM 
MKL and ACML (Fig. [I 

3. SpAMM numeric multiply at 4 x 4 SpAMM(4 x 4) 

4. SpAMM numeric multiply at 16x16, SpAMM(16 x 16) 
no conditionals 

5. SpAMM multiply at 16 x 16 level, SpAMM(SGEMM) 
SGEMM 



to the inverse of the execution time, see Appendix [A] for 
details. 

Test 1 consists of a comparison between the different 
convolution implementations, described in Sec. |IIID 2| 
Test 2 consists of matrix multiplications with vendor 
tuned SGEMM for varying matrix sizes to asses the per- 
formance of our numeric multiply (Sec. HID 3). The 
other tests consist of matrix multiplications with SpAMM 
to assess the performance and product error of SpAMM. All 
SpAMM tests use the unrolled convolution implementation 
with SSE intrinsics, as described in Sec. |IIID 2[ for the 
symbolic multiply since it proved to be the fastest, and 



different versions of the numeric multiply. The numeric 
multiply used in test 3 applies the SpAMM condition at a 
4x4 submatrix level, which we consider our target gran- 
ularity. The kernel in test 4 uses a version of the numeric 
multiply in which the SpAMM conditionals are commented 
out, which effectively means that the SpAMM condition is 
applied at a 16 x 16 submatrix level. Note that the ex- 
ecution flow is identical to test 3. The numeric multiply 
used in test 5 is a vendor tuned SGEMM at a 16 x 16 sub- 
matrix level. A summary of all benchmark tests can be 
found in Table H 



V. RESULTS 

Our results are qualitatively identical on the AMD and 
the Intel platforms and we will present only the AMD 
results in the following. In addition, we did not find any 
difference in the error between MKL and ACML and report 
for the error, Figs. [6] and [7J only the MKL results. 



A. Symbolic Multiply 

Fig. [5] (test 1 of Table IT]) shows a cycle count compar- 
ison of the symbolic part of SpAMM(4 x 4, r = 2 x 10~ 8 ) 
between the recursive and the two linkless tree versions 
(see Sec. HID 2 1. For a matrix of n = 7500, the unrolled 



convolution using arrays is almost 4.5 times faster than 
the recursive version and about 2.5 times faster than the 
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FIG. 5: The cycle counts of the symbolic part of SpAMM 
with a tolerance of r = 2 x 10~ 8 for RHF/6-31G" water 
cluster for the recursive multiply and the two linkless tree 
implementations. 



nested convolution using hashtables. At this matrix size, 
the symbolic part accounts for about 12% of the total 
cost of the multiply. 



B. Numeric Multiply vs. SGEMM 

Figure [3] (test 2 of Table p} shows the performance of 
SGEMM on the Intel and AMD platforms from MKL and 
ACML. The performance slowly ramps up and asymptotes 
to 12.6 Gflop/s for ACML and MKL on AMD and 12.7 
Gflop/s for ACML and 19.8 Gflop/s for MKL on Intel. Also 
shown in the figure the average performance of our nu- 
meric multiply on a dense 16 x 16 block for large matrices, 
exhibiting a performance of 8.1 (64% of peak ACML/MKL) 
and 13.7 (69% of peak MKL) Gflop/s on AMD and Intel, 
respectively. At a matrix size of 16 x 16, the performance 
ratio between the SpAMM kernel and SGEMM is 162% (5.0 
Gflop/s for MKL and ACML) for AMD and 161% (8.5 for 
MKL) for Intel. 



C. Errors 

In order to facilitate a fair comparison between tests 
3, 4, and 5 (Table [jj), appropriate tolerance values were 
found to match the matrix errors, Eq. (13 1, in the dif- 
ferent tests. In Figs. [6] and [7] the matrix errors of the 
test calculations for the two basis sets are shown. The 
differences in error between tests 4 and 5 were found to 
be insignificant and results of test 5 are not shown for 
clarity in these figures. For comparison, a dense multi- 
ply exhibits an error of around 10™ 6 and we note that 
SpAMM(4 x 4, r = 0) achieves a smaller error than SGEMM 
for both basis sets, which is likely due to differences in 
execution order; hierarchical vs. row-column. For the 
following tests of SpAMM we chose two error targets, 10 -6 
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FIG. 6: Error comparison as measured by the max norm, 
Eq. (B), for water clusters of different sizes in RHF/STO-2G 
on AMD. 
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FIG. 7: Error comparison as measured by the max norm, 
Eq. ( |13[ ), for water clusters of different sizes in RHF/6-31G** 
on AMD. 



and 10 -5 , and found the tolerance values shown in Ta- 
ble [n] Increasing the granularity from 4 x 4 to 16 x 16 
requires an increase of the tolerance by about a factor 
of 4 for the STO-2G basis set, and a factor of 5 for the 
6-31G" basis set. 



D. Effective Performance of SGEMM vs. SpAMM 

Comparing the effective performance of SGEMM, 
SpAMM(4x4, r = 2xlCT 8 ) and SpAMM(4x4, r = 2xlCT 7 ), 
for the RHF/6-31G** water clusters in Fig. [8} we find an 
approximately quadratic performance increase of SpAMM 
with respect to matrix size. Fits to a quadratic are shown 
as solid lines. The SGEMM performance is matrix size in- 
dependent above n ~ 1000 due to memory bandwidth 
saturation and the O (n 3 ) complexity of the standard 
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TABLE II: The tolerance values identified for the water clus- 
ters at different SpAMM granularity and errors. 



basis set error granularity tolerance 



STO-2G 



10" 



4x4 
16 x 16 



1.0 x 10" 
4.0 x 10" 



4x4 1.0 x 10" 6 
16 x 16 3.5 x 10" 6 



10" 



6 4x4 2.0 x 10~ 8 

6 _ 31G « 16x16 1.0 xlO- 7 

5 4x4 2.0 x 10" 7 
16 x 16 1.0 x 10~ 6 



SpAMM, 4x4, 1.0e-7 
SpAMM, 4x4, 1.0e-6 
SpAMM, 16x16, 4.0e-7 
SpAMM, 16x16, 3.5e-6 
SpAMM, MKL, 4.0e-7 
SpAMM, ACML, 4.0e-7 
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FIG. 8: Effective performance of SpAMM(4 x 4, r = 2 x 10~ 8 ) 
andSpAMM(4x4, r = 2xl0~ 7 ) and SGEMM for the water clusters 
in RHF/6-31G** on AMD. Solid lines show fits to second 
order polynomial. 



FIG. 9: Performance comparison for water clusters of differ- 
ent sizes in RHF/STO-2G on AMD. Solid lines show fits to a 
line. Shown in the inset, the effective performance in Gflop/s 
compared to SGEMM from MKL and ACML (thick line). 
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matrix multiply. We note the early cross-over of SpAMM 
at around n ~ 2000 and the fact that SpAMM performs at 
approximately 50% of peak SGEMM performance as n — > 
consistent with the results shown in Fig. [3J 



FIG. 10: Performance comparison for water clusters of dif- 
ferent sizes in RHF/6-31G** on AMD. Solid lines show fits 
to a line. Shown in the inset, the effective performance in 
Gflop/s compared to SGEMM from MKL and ACML (thick line). 



E. Performance of SpAMM 



Figures [9] and 10 show the measured performance of 
tests 3, 4, and 5 of Table [I] for the tolerance values of 
Table |TTJ Fits to a line are shown to highlight the ap- 
proximate linear scaling behavior of SpAMM and to give 
an indication of slope. The cycle count, which is pro- 
portional to CPU time, is shown in the main graph, and 
the effective performance in the inset. For comparison 
the performance of SGEMM is shown in the inset as a thick 
line. 

For the smaller basis set, SpAMM(4 x 4) is about 1.2- 
1.4 times faster compared to SpAMM(16 x 16), with this 
gain diminishing with increasing tolerance. However, no 
such performance difference is observed for the larger ba- 



sis. SpAMM(SGEMM) is significantly slower for both error 
targets with MKL surprisingly outperforming ACML, the 
CPU vendor's library. SpAMM(4 x 4) and SpAMM(16 x 16) 
achieve approximate linear scaling at n« 600 in STO- 
2G and n w 2000 in 6-31G**. For STO-2G, cross- 
over between SpAMM and SGEMM occurs at n « 300 for 
SpAMM(4 x 4), n w 500 for SpAMM(16 x 16), and n « 900 
for SpAMM(SGEMM, r = 4.0 x 10~ 7 ). For 6-31G**, cross- 
over occurs at n sa 2250 for SpAMM(4 x 4, r = 2.0 x 10~ 8 ) 



and SpAMM(16 x 16, r = 10 



-7\ 



n w 1250 for SpAMM(4 x 4, 
and 



t = 2.0 x 10~ 7 ) and SpAMM(16 x 16, r = 10~ 6 ) 
n sw 3500 for SpAMM(SGEMM, r = 1.0 x 10~ 7 ). 

Figures [TT] and p~2] show performance ratios of tests 
3 and 4 of Table |I] for a target error of 10~ 6 . Shown 
are the ratios of complexity (Ciq/C^), measured flops 
(flopi6/fiop4), and cycle count (Tig/T^), where we define 
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FIG. 11: Performance ratios of SpAMM(4 x 4, r = 1.0 x 
10~ 7 ) and SpAMM(16 x 16, r = 4.0 x 10~ 7 ) for water clusters 
of different sizes in RHF/STO-2G on AMD. Shown are the 
complexity ratio, Cia/d (the number of 4 x 4 x 4 matrix 
products), the flop ratio, /ie//4 (measured), and the cycle 
count ratio, T\q/Ta. 
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FIG. 12: Performance ratios of SpAMM(4 x 4, r = 2.0 x 10" 8 ) 
and SpAMM(16 x 16, r = 10~ 7 ) for water clusters of different 
sizes in RHF/6-31G** on AMD. 



the complexity as the number of 4 x 4 x 4 matrix multi- 
plications executed, and one 16 x 16 x 16 multiplication 
translates into a complexity of 64. For the largest water 
cluster, the complexity ratio is about 4.5 for STO-2G and 
2.5 for 6-31G**, the flop ratio is about 3.7 for STO-2G 
and 2.1 for 6-31G**, and the cycle ratio is about 1.3 for 
STO-2G and 1.1 for 6-31G**. Note that the difference 
between complexity and flop ratios is due to the norm 
product computations in the conditionals. 



VI. DISCUSSION 

The error achieved by SpAMM(4 x 4, r < 2.0 x 10~ 8 ) 
(Figs. [6] and [7| is lower than SGEMM. We attribute this 



behavior to superior order of operation associated with 
hierarchical summation of equi-magnitude blocks. A re- 
lated problem for future work involves establishing pre- 
cise relationships between the SpAMM approximation, ma- 
trix decay and application specific measures of error. In 
addition, our implementation can be readily extended 
to double-precision; however, our goal in this work is to 
achieve full single-precision (or better) and an early onset 
of linear scaling. 

The storage complexity for dense matrices is O (n 2 ) 
but 0(n) in the sparsified case for matrices exhibiting 
approximately exponential decay. The computational 
complexity for convolution is O (n 2 logn 2 ) in the dense 
and O (nlogn) in the sparsified case due to either oc- 
tree recursion or index sorting in the symbolic multiply, 
see Fig. [5] The computational complexity of the numeric 
multiply is 0(n) in either case, when decay is fast enough, 
or the matrix threshold is large enough. For the matrices 
studied here, we find that the numeric multiply domi- 
nates a small but admittedly quadratic component from 
the symbolic kernel. Nevertheless we observe an approx- 
imately quadratic increase in effective performance, as 
shown in Fig. [81 with an approximately linear increase in 
cycle count, as shown in Figs. [9] and 10 



Comparing the cycle counts shown in Figs. [9] and [T0| 
between SpAMM(4 x 4) and SpAMM(16 x 16), SpAMM(4 x 4) 
holds a slight advantage over SpAMM(16 x 16) for both ba- 
sis sets. We attribute differences to variations between 
matrix structures; in the STO-2G basis the submatrices 
of H and O are small and SpAMM(4 x 4) is able to more 
fully exploit the matrix structure leading to a bigger per- 
formance advantage, while in the 6-31G** basis, the sub- 
matrices of H and O are larger, and SpAMM(16 x 16) is 
more appropriate. We find that the corresponding flop 
count ratios, flopi6/flop4, (shown in Figs. 11 and 12 to 
be significantly larger than the cycle count ratios, T\q/T±, 
clearly indicating the difficulty of translating the reduced 
complexity of fine grained algorithms into performance 
gains on commodity platforms. It should be noted that 
both numeric kernels make full use of SSE, and the issue 
is not one of inefficient use of floating point resources. 

The significant complexity reduction shown in Figs.[TT] 
and [12] as well as the early onset of linear scaling shown 
[9] and 



in Figs 
granularities 



10 



underscore the importance of fine 
However, the corresponding irregular ac- 
cess prevents the hardware prefetch mechanism in the 
CPU from activating, resulting in a performance loss 
due to higher memory latencies, partially offsetting the 
performance gains due to complexity reduction (2-3x in 
this case). So far we have been unsuccessful in closing 
this performance gap by using software prefetch state- 
ments and 16 x 16 blocks, as additional conditionals and 
the required large distance between prefetch and data 
lead to no improvement. Indeed, it appears that op- 
timal placement of software prefetch is not only diffi- 
cult, but that the performance gains are generally limited 

[niHaisaiisDi. 

Recently much effort has been dedicated to the co- 
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design of improved prefetch mechanisms, see for in- 
stance the "Scheduled Region Prefetch" [Ml [138], "Flux 
Cache" [62], "Traversal Cache" [H7], or the "Block 
Prefetch Operation" [S3]. These works support and un- 
derscore our conclusion that it is the mechanism of mem- 
ory access, and not the memory hierarchy per se that is 
the bottle neck for fine grained irregular algorithms in- 
cluding SpAMM(4 x 4), related iV-body algorithms, as well 
as the generic pointer chasing problem. 



VII. CONCLUSIONS 

We have developed an optimized single-precision im- 
plementation of the SpAMM algorithm using linkless trees 
and SSE intrinsics in combination with a hand-coded as- 
sembly kernel, and demonstrated early cross-overs with 
vendor tuned SGEMMs and an early onset of linear scal- 
ing for dense quantum chemical matrices. For small val- 
ues of the product space threshold r, we find SpAMM ex- 
hibits superior error control relative to SGEMM. While this 
is unsurprising given the O (n 2 ) error scaling for classi- 
cal matrix-matrix multiplication relative to an O (n log n) 
scaling for recursive multiplication [53], it is worth not- 
ing that our fast implementation is doing something else 
entirely, namely norm sorting prior to convolution (as 
described in Section III D 2 ) . The error analysis of this 



approach, as well as for approximate recursive multipli- 
cation of SFC clustered matrices, is an interesting and 
open problem. Importantly, the issue has been refocused 
on achieving performance and error control that is supe- 
rior to the classical multiply, rather than on the errors 
associated with vector space truncation and their accu- 
mulation in classical multiplication. 

In this implementation 4 , the matrices are stored in 
contiguous chunks of size 16 x 16, which is unusually small 
compared to other high performance matrix multiplies. 
In addition, the norm condition, Eq. (pf, is applied to 
4x4 submatrices giving the implementation fine grained 
error control with a remaining potential for 2-3x gains in 
performance on future hardware with improved prefetch. 
We speculate that these gains might be unlocked with 
hardware modifications such as the Block Prefetch Op- 
eration [33]. Nevertheless, relative to naive implementa- 
tions of SpAMM using vendor tuned versions of SGEMM, such 
as those found in Intel's Math Kernel Library (MKL) or 
AMD's Core Math Library (ACML), our optimized version 
is found to be significantly faster. 

Matrix-matrix multiplication is now aligned with the 
TV-Body programming models employed by other quan- 
tum chemical solvers [351 [3J3 HO] , and more broadly with 
the generalized TV-Body framework [731 E2 US EI] • As 



such, its worth noting that SFC ordering in the vector 
and product space, shown in (C) and (D) of Fig. [I] may 
also provide a mechanism for domain decomposition and 
load balance, corresponding to methods that have been 
already proven for other parallel irregular TV-Body prob- 
lems |32l H31 11391 1143] . Likewise, overdecomposition 
of the recursive, three-dimensional SpAMM task space may 
provide significant opportunities for parallelism through 
advanced middleware such as charm++ ^92J or tascel 
[1021 1106] . as well as through task parallel program- 
ming models supported by Cilk Plus [2] or version 3.0 
of the OpenMP standard ^\ . The low level SIMD opti- 
mizations based on linkless trees presented here also may 
be extended to evolving vector lengths through compiler 
auto-vectorization and SIMD language extensions such 
as those provided by the ispc compiler [3], with applica- 
bility to AVX, Intel Xeon Phi architectures and CPUs. 
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Appendix A: Effective Performance 

Typically the performance of a solver implementa- 
tion is measured by the number of CPU cycles (seconds 
of CPU time), or the number of floating point opera- 
tions (flop) per cycle (second). While the former di- 
rectly measures "time to solution", the latter gives in- 
sight into the efficiency of the implementation execut- 
ing computation instructions under the constraints of 
memory access, register pressure, and other factors. Be- 
cause of the reduction in flops through Eq. ^ , the per- 
ceived performance of SpAMM (inverse time to solution) 
is different than its measured floprate. For comparisons 
with SGEMM, we therefore model the number of flops by 
F = m [k(l + 2ri) — n] and express the effective perfor- 
mance as the ratio of F and the number of CPU cycles 
(seconds). 
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